January 20, 2014

General Description
===================

	The files included below contain the resequenced versions of the D. simulans (Tsimbazaza 
strain; Hollocher et al. 2000) and D. sechellia (Reference strain; 14021-0231.36) genomes, as well 
as genome from D. melanogaster strains zhr and z30 (McManus et al., 2014 (accepted)) in fasta format. 
Chain files are also included to convert coordinates on these genomes to those of the D. melanogaster 
reference version 3 (dm3), using the liftOver tool available on the UCSC genome browser website 
(http://hgdownload.soe.ucsc.edu/admin/exe/). Custom perl scripts are also included to convert 
coordinates from the D. simulans and D. sechellia genomes into dm3 coordinates (as this is a multi-
step process). You must have the liftOver tool installed in your path to use these perl scripts.

!!!Also note that your bed file names must end with “.bed or “.BED” to use the custom perl scripts.

	Test data have also been included to try out the custom perl scripts and liftOver chains for 
D. sechellia and D. simulans. Fastq files with the raw sequence reads are also present.

Genome resequencing
===================

	Genomic sequence reads from D. sechellia, D. simulans, and D. melanogaster were aligned 
to the droSec1, droSim1, and dm3 assembly releases, respectively using 
BWA (version 0.5.6)(Li and Durbin 2010). Reads were aligned separately using default parameters 
and merged using the BWA sampe command. The resulting SAM format files were converted to BAM format 
and snps and indels were called using SAMtools (version 0.1.7a; commands view, sort, and pileup)
(Li et al. 2009). SNPs and indels were filtered using the samtools.pl varFilter command to retain 
variants with phred quality scores greater than 20 (estimated 1% error). A custom Perl script, 
snp_adder.pl, was used to produce strain-specific genomes. This script modifies reference genome 
sequence to incorporate the filtered variants. Insertion / deletion positions were recorded and 
used to produce custom chain files to use with the UCSC liftOver script.

	For D. sechellia and D. simulans, gDNA sequence reads were remapped to the strain-specific 
genomes. Non-mappable read-pairs were assembled into contigs using Velvet (version 1.0.15; parameters: 
velveth k=35; velvetg –exp_cov auto, -min_contig_lgth 300)(Zerbino and Birney 2008). These contigs 
were aligned to the strain-specific genomes using Blat. Contigs whose 5’ and 3’ ends both align to 
the strain-specific genome were retained and extended 100 bp in each direction. These “extended” 
contigs span gaps in each organism’s genomic sequence. Extended contigs were combined with the 
strain-specific genomes to create an intermediate target sequence for gDNA read alignment. Genomic 
DNA was remapped to this intermediate target genome, and non-mappable read-pairs were assembled into 
contigs as above. The resulting contigs were combined with extended contigs and aligned to the dm3 
reference genome using LASTZ (Harris 2007). Contigs that aligned uniquely to the D. melanogaster 
reference genome (dm3) were kept as the “extra-genome”. LiftOver chain files were produced from the 
LASTZ alignment output using the axtChain, chainNet, and netChainSubset utilities from the UCSC 
Genome Browser (Kent et al. 2003).

Using these files
===================

Once liftOver is installed in your path, you can invoke the D. simulans and D. sechellia liftOver 
steps with:

perl sim_lift.pl Sim_testseq_sorted.bed.

or 

perl sec_lift.pl Sec_testseq_sorted.bed

Reference
===================

If you use these genomes, please reference:

McManus CJ, Coolon JD, Eipper-Mains J, Wittkopp PJ, and Graveley, BR. Evolution of Splicing 
Regulatory Networks in Drosophila. Genome Research, (accepted, 2014).

